Skip to content

Don't execute the proxy queue while adding to it from same thread. #24565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

brendandahl
Copy link
Collaborator

test_pthread_cancel was intermittently failing with a deadlock. From the backtrace, I found we were adding to the queue which can sometimes call malloc->emscripten_yield->emscripten_proxy_execute_queue which also tries to lock the same queue on the same thread.

@brendandahl
Copy link
Collaborator Author

Still working on a more reliable test, but I wanted to try this out...

@brendandahl brendandahl marked this pull request as draft June 13, 2025 21:32
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Hopefully we can also figure out some way to improve the testing ?

@tlively
Copy link
Member

tlively commented Jun 16, 2025

Solution LGTM, but I forgot about the testing.

@brendandahl
Copy link
Collaborator Author

Changing to using a thread local causes failures in test_pthread_run_on_main_thread_minimal_runtime. I didn't get very far debugging this. I won't be able to work on this until June ~25th, so if anyone wants to pick it up feel free to.

sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 17, 2025
Also use `bool` rather than `int` to better convey intent.

Split out from emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 17, 2025
Also use `bool` rather than `int` to better convey intent.

Split out from emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 17, 2025
Also use `bool` rather than `int` to better convey intent.

Split out from emscripten-core#24565
sbc100 added a commit that referenced this pull request Jun 17, 2025
There was some suspicion that this might lead to a code size change but
it doesn't seem to (I suppose because bool/int doesn't have any code in
its constructor).

Also use `bool` rather than `int` to better convey intent.

Split out from #24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 18, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 18, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 19, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 19, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
brendandahl added a commit to brendandahl/emscripten that referenced this pull request Jul 2, 2025
TLS was not initialized on workers causing the tls_base address
to be 0 and write to the wrong memory location.

Fixes emscripten-core#21528 and is also needed for PR emscripten-core#24565
brendandahl added a commit that referenced this pull request Jul 2, 2025
TLS was not initialized on workers causing the tls_base address to be 0
and write to the wrong memory location.

Fixes #21528 and is also needed for PR #24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 7, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
@brendandahl brendandahl marked this pull request as ready for review July 7, 2025 23:48
test_pthread_cancel was intermittently failing with a deadlock. From the
backtrace, I found we were adding to the queue which can sometimes call
malloc->emscripten_yield->emscripten_proxy_execute_queue which
also tries to lock the same queue on the same thread.
@brendandahl brendandahl requested a review from sbc100 July 7, 2025 23:50
@brendandahl
Copy link
Collaborator Author

Unfortunately, I still haven't found a reliable way to reproduce this in a test case. Are we fine to land without?

@tlively
Copy link
Member

tlively commented Jul 8, 2025

Does this test sketch sound plausible?

  1. Main thread spawns thread 1 and spins
  2. Thread 1 asynchronously sends a set-flag message to the main thread
  • This allocates and fills a task_queues buffer of size 1.
  1. Thread 1 signals main thread to continue
  2. Main thread stops spinning and asynchronously sends a nop task to thread 1.
  • This allocates and fills a task_queues buffer of size 2.
  1. Main thread does not deadlock on the previous step
  2. Main thread asserts that the flag is not set, showing that the task queue has not been executed.

Maybe we can also hook the allocator to set flags as well and prove that the allocations happened at the expected times during the test. We do something similar when testing frees here: https://github.com/emscripten-core/emscripten/blob/main/test/pthread/test_pthread_proxying_refcount.c#L25-L28.

If that doesn't sound plausible, then testing by observing fewer flakes SGTM.

@sbc100
Copy link
Collaborator

sbc100 commented Jul 8, 2025

@tlively (I added some numbers to your steps above). Is the expectation that we could make step 4 deterministically fail without this PR?

@tlively
Copy link
Member

tlively commented Jul 8, 2025

Yes, that's what I was aiming for. I haven't tried it, though :)

sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 8, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 8, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 8, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 8, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 8, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jul 8, 2025
This would really have helped in at least detecting the deadlock that
was found in emscripten-core#24570 / emscripten-core#24565
sbc100 added a commit that referenced this pull request Jul 9, 2025
This would really have helped in at least detecting the deadlock that
was found in #24570 / #24565
@brendandahl brendandahl requested a review from tlively July 9, 2025 21:58
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % test comments


int main() {
pthread_create(&looper, NULL, looper_main, NULL);
emscripten_proxy_async(emscripten_proxy_get_system_queue(), looper, run_on_looper, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this ever set should_quit before looper_main actually starts running?

Can you add some comments somewhere in this file about exactly that case we are testing for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of assumed the thread start would be guaranteed to run before any proxy messages, but now I'm not sure. Even if it does run before, we'd still hit a deadlock since the main thread will be locked during task creation and trying to run the queue and lock again. It doesn't really matter what the other thread is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have looper_main set a flag and have main wait for that flag to be set before doing the proxying to be sure. +1 to adding comments, especially about how the deadlock would occur without the bug fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test a bit more and removed the looper thread and just sends the task to the same thread. This makes the test a little less realistic, but highlights it doesn't really matter what the thread is doing.

// 3. Malloc then attempts to execute and lock the already-locked queue,
// causing a deadlock.
// This test ensures our implementation prevents this re-entrant lock.
emscripten_proxy_async(emscripten_proxy_get_system_queue(), pthread_self(), task, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, so you don't even need a second thread! Makes sense I guess since this is a double-lock/self-deadlock issue.

Is it valid to send work to yourself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_pthread_proxying.c has a test for proxing to itself, so I assume so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but only asynchronous self-proxying is allowed, since synchronous self-proxying would obviously deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants